Goto

Collaborating Authors

 spatial environment


Generic Multimodal Spatially Graph Network for Spatially Embedded Network Representation Learning

arXiv.org Artificial Intelligence

Spatially embedded networks (SENs) represent a special type of complex graph, whose topologies are constrained by the networks' embedded spatial environments. The graph representation of such networks is thereby influenced by the embedded spatial features of both nodes and edges. Accurate network representation of the graph structure and graph features is a fundamental task for various graph-related tasks. In this study, a Generic Multimodal Spatially Graph Convolutional Network (GMu-SGCN) is developed for efficient representation of spatially embedded networks. The developed GMu-SGCN model has the ability to learn the node connection pattern via multimodal node and edge features. In order to evaluate the developed model, a river network dataset and a power network dataset have been used as test beds. The river network represents the naturally developed SENs, whereas the power network represents a man-made network. Both types of networks are heavily constrained by the spatial environments and uncertainties from nature. Comprehensive evaluation analysis shows the developed GMu-SGCN can improve accuracy of the edge existence prediction task by 37.1\% compared to a GraphSAGE model which only considers the node's position feature in a power network test bed. Our model demonstrates the importance of considering the multidimensional spatial feature for spatially embedded network representation.


Multi-modal and Multi-scale Spatial Environment Understanding for Immersive Visual Text-to-Speech

arXiv.org Artificial Intelligence

Visual Text-to-Speech (VTTS) aims to take the environmental image as the prompt to synthesize the reverberant speech for the spoken content. The challenge of this task lies in understanding the spatial environment from the image. Many attempts have been made to extract global spatial visual information from the RGB space of an spatial image. However, local and depth image information are crucial for understanding the spatial environment, which previous works have ignored. To address the issues, we propose a novel multi-modal and multi-scale spatial environment understanding scheme to achieve immersive VTTS, termed M2SE-VTTS. The multi-modal aims to take both the RGB and Depth spaces of the spatial image to learn more comprehensive spatial information, and the multi-scale seeks to model the local and global spatial knowledge simultaneously. Specifically, we first split the RGB and Depth images into patches and adopt the Gemini-generated environment captions to guide the local spatial understanding. After that, the multi-modal and multi-scale features are integrated by the local-aware global spatial understanding. In this way, M2SE-VTTS effectively models the interactions between local and global spatial contexts in the multi-modal spatial environment. Objective and subjective evaluations suggest that our model outperforms the advanced baselines in environmental speech generation. The code and audio samples are available at: https://github.com/AI-S2-Lab/M2SE-VTTS.


Understanding spatial environments from images

Science

The ability to understand spatial environments based on visual perception arguably is a key function of the cognitive system of many animals, including mammalians and others. A common presumption about artificial intelligence is that its goal is to build machines with a similar capacity of "understanding." The research community in artificial intelligence, however, has settled on a more pragmatic approach. Instead of attempting to model or quantify understanding directly, the objective is to construct machines that merely solve tasks that seem to require understanding. Understanding can only be measured indirectly, for example, by analyzing the ability of a system to generalize the solving of new tasks, which is sometimes called transfer learning (1).


Cities Will Have to Be Redesigned to Confuse Invading Robots

#artificialintelligence

One of the most remarkable details of a fatal collision earlier this month involving a tractor trailer and a Tesla electric car operating in self-driving mode was the fact that the car apparently mistook the side of the truck for the sky. As Tesla explained in a public statement following the accidental death, the car's autopilot was unable to see "the white side of the tractor trailer against a brightly lit sky"--which is to say, it was unable to differentiate the two. The truck was not perceived as a discrete object, in other words, but as something indistinguishable from the larger spatial environment. It was more like an elision, a continuation of the sky by deceptive means. Examples like this are tragic, to be sure, but they are also technologically interesting, in that they give momentary glimpses of where robotic perception has failed.


A Model Attention and Selection Framework for Estimation of Many Variables, with Applications to Estimating Object States in Large Spatial Environments

AAAI Conferences

Robots performing service tasks such as cooking and cleaning in human-centric environments require knowledge of certain environmental states in order to complete tasks successfully. While much effort has gone into developing various estimators for deriving distributions on values of unknown states, less attention has been placed on why the particular estimation problem arises. In this work, I argue that state estimation should no longer be treated as a black box. Estimating large sets of variables is computationally costly; just because a technique exists to estimate the values of certain variables does not justify its application. For robots whose ultimate mission is to complete tasks, only variables that are relevant to successful completion should be estimated. I propose to initially only track a minimal set of directly-relevant variables (attention), and gradually increase the sophistication of models on-demand (refinement), in a local fashion. This estimator refinement process is triggered by violations in expectations of task success (mismatch). This model selection framework is demonstrated through a proof-of-concept case study.